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Abstract 

This article describes an algorithm for reducing the intermediate alphabets in cascades of 
finite-state transducers (FSTs). Although the method modifies the component FSTs, there is 
no change in the overall relation described by the whole cascade. No additional information 
or special algorithm, that could decelerate the processing of input, is required at runtime. Two 
examples from Natural Language Processing are used to illustrate the effect of the algorithm 
on the sizes of the FSTs and their alphabets. With some FSTs the number of arcs and symbols 
shrank considerably. 



1. Introduction 

This article describes an algorithm for reducing the intermediate alphabet occurring in the 
middle of a pair of finite-state transducers (FSTs) that operate in a cascade, i.e., where the first 
FST maps an input string to a number of intermediate strings and the second maps those to 
a number of output strings. With longer cascades, the algorithm can be applied pair-wise to 
all FSTs. Although the method modifies the component FSTs and component relations that 
they describe, there is no change in the overall relation described by the whole cascade. No 
additional information or special algorithm, that could decelerate the processing of input, is 
required at runtime. 

Intermediate alphabet reduction can be beneficial for many practical applications that use 
FST cascades. In Natural Language Processing, FSTs are used for many basic steps (Karttunen 
etal, 1996; Mohri, 1997), such as phonological (Kaplan & Kay, 1994) and morphological anal- 
ysis (Koskenniemi, 1983), part-of speech disambiguation (Roche & Schabes, 1995; Kempe, 
1997; Kempe, 1998), spelling correction (Oflazer, 1996), and shallow parsing (Koskenniemi 
et ah, 1992; Ait-Mokhtar & Chanod, 1997). Some of these applications, such as shallow pars- 
ing, use FST cascades. Others could jointly be used in a cascade. In these cases, the proposed 
method can reduce the sizes of the FSTs. 

The described algorithm has been implemented. 
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1.1. Conventions 

The conventions below are followed in this article. 

Examples and figures: Every example is shown in one or more figures. The first figure 
usually shows the original network or cascade. Possible following figures show modified forms 
of the same example. For example, Example 1 is shown in Figure 1 and Figure 2. 

Finite-state graphs: Every network has one initial state, labeled with number 0, and one or 
more final states marked by double circles. The initial state can also be final. All other state 
numbers and all arc numbers have no meaning for the network but are just used to reference a 
state or an arc from within the text. An arc with n labels designates a set of n arcs with one label 
each that all have the same source and destination. In a symbol pair occurring as an arc label, 
the first symbol is the input and the second the output symbol. For example, in the symbol pair 
a : b, a is the input and b the output symbol. Simple (i.e. unpaired) symbols occurring as an 
arc label, represent identity pairs. For example, a means a : a. The question mark, "?", denotes 
(and matches) all unknown symbols, i.e., all symbols outside the alphabet of the network. 

Input and output side: Although FSTs are inherently bidirectional, they are often intended 
to be used in a given direction. The proposed algorithm is performed wrt. the direction of 
application. In this article, the two sides (or tapes or levels) of an FST are referred to as input 
side and output side. 

2. Previous Work 

The below described algorithm of intermediate alphabet reduction is related to the idea of 
label set reduction (Koskenniemi, 1983; Karttunen et al, 1987). The later is applied to a single 
FST or automaton. It groups all arc labels into equivalence classes, regardless whether these are 
atomic labels (e.g. "a"), identity pairs (e.g. "a : a"), or non-identity pairs (e.g. "a : x"). Labels 
that always co-occur on arcs with the same source and destination state, are put into the same 
equivalence class. One label is then selected from every class to represent the class. All other la- 
bels are removed from the alphabet, and the corresponding arcs are removed from the network, 
which can lead to a considerable size reduction. Label set reduction is reversible, based on the 
information about the equivalence classes. At runtime, this information is required together 
with a special algorithm to interpret every label in the network as the set of labels in the corre- 
sponding equivalence class. For example, if the label a represents the class {a, a : b, b, c : z } 
then it must map c, occurring at the input, to z . 

3. Reduction of Intermediate Alphabets 

The algorithm of intermediate alphabet reduction is applied to a pair of FSTs that operate in 
a cascade rather than to a single FST. It reduces the intermediate alphabet between the two FSTs 
without necessarily reducing the label sets of the FSTs. With longer cascades, the algorithm can 
be applied pair-wise to all FSTs. Although the component FSTs and component relations that 
they describe are (irreversibly) modified, there is no change in the overall relation described 
by the whole cascade. No additional information or special algorithm, that could decrease 
the processing speed, is required at runtime. The fact that every intermediate symbol actually 
represents a set of (one or more) symbols, can be neglected at that point. Every symbol will be 
considered at runtime just as itself. 
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3.1. Alphabet Reduction in Transducer Pairs 



We will first describe the algorithm for an FST pair where the two FSTs, 7\ and T 2 , operate 
in a cascade, i.e., 7\ maps an input string to a number of intermediate strings which, in turn, are 
mapped by T 2 to a number of output strings. 



input [ output 



input 



output 



Equivalence classes: {ao}, {«i, a 2 }, {«3, 04} 
Figure 1: Transducer pair (Example 1) 



Example 1 shows an FST pair and part of its input, intermediate, and output alphabet (Fig. 1). 
Suppose, both intermediate symbols a.\ and a 2 are always mapped to the same output symbol 
which can be y, v, or x depending on the context. This means, a\ and a 2 constitute an equiv- 
alence class. There may be another class formed by a 3 and a 4 . If we are not interested in the 
actual intermediate symbols but only in the final output, we can select one member symbol of 
every class to represent the class, and replace all other symbols by the representative of their 
class. In Example 1, this means that a 2 is replaced by cti, and a 4 by a 3 (Fig. 2). 



input 1 output 



Tl 



input 



output 



Figure 2: Transducer pair with reduced intermediate alphabet (Example 1) 



The algorithm works as follows. First, equivalence classes are constituted among the input 
symbols of T 2 (Fig. 3). 1 For this purpose, all symbols are put initially into one single class 
which is then more and more partitioned as the arcs of T 2 are inspected. The construction of 
equivalence classes terminates either when all arcs have been inspected or when the maximal 
partitioning (into singleton classes) is reached. 

Two symbols, en and otj, are considered equivalent if for every arc with aij as input symbol, 
there is another arc with otj as input symbol and vice versa, such that both arcs have the same 

'in Figure 3, 4, and 7 only transitions and states that are relevant for the current purpose are represented by 
solid arcs and circles, and all the others by dashed arcs and circles. 
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source and destination state and the same output symbol. In Example 2, we constitute the 
equivalence classes {a }, {a l5 a 2 }, and {a 3 , a 4 } (Fig. 3). Here, a constitutes a class on its 
own because it first co-occurs with a,\ and a 2 in the arc set {100, 101, 102}, and later with a 3 
and a 4 in the arc set {120, 121, 122}. 




Equivalence classes: {ap}, {oi, a 2 }, {03, 0:4} 
Figure 3: Second transducer of a pair (Example 2) 



Subsequently, all occurrences of intermediate symbols are replaced by the representative 
of their class. In Example 2, we selected the first member of each class as its representative. 
The replacement must be performed both on the output side of Ti and on the input side of T 2 . 
Figure 4 shows the effect of this replacement on T 2 in Example 2. 




Figure 4: Second transducer of a pair with reduced intermediate alphabet (Example 2) 

The algorithm can be applied to any pair of FSTs: They can be ambiguous or they can contain 
"?", the unknown symbol, or e, the empty string, or even e-loops. Although e and ? cannot be 
merged with other symbols, this does not represent a restriction for the type of the FSTs. 

3.2. Alphabet Reduction in Transducer Cascades 

In a cascade, intermediate alphabet reduction can be applied pair-wise to all FSTs starting 
from the end of the cascade. Example 3 shows a cascade of four FSTs with the intermediate 
alphabets A, B, and r (Fig. 5). The reduction is first applied to the last intermediate alphabet, 
r, between T 3 and T 4 . Suppose, there are three equivalence classes in r, namely {70,71}, 
{72}, and {73, 74}. According to the above method, the class {70, 71} can be represented by 
7o> {72} by 72, and {73, 74} by 73. This means, all occurrences of 71 are replaced by 70 and all 
occurrences of 74 by 73, both on the output side of T 3 and on the input side of T 4 (Fig. 5, 6). 
Consequently, (3i in the preceding alphabet, B, will now be mapped to 70 instead of 71, and (3 2 
to 73 instead of 74. The latter mapping actually exists already. 

Subsequently, the preceding intermediate alphabet, B, is reduced. It may contain two equiv- 
alence classes, {(3 , f3±} and {f3 2 }. Both members of {/?o, Pi} are at present mapped either to 
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input 



output 



input 



output 



B 



input 



output 



r 

■v - y 4 : 



input 



output 



Equivalence classes in different alphabets: 
A:{{a }, {ai,a 2 }, {a 3 }} , £:{{/?o, /?i}, {&}} , ^{{70,71}, {72}, {73,74}} 



Figure 5: Transducer cascade (Example 3) 



70 (previously {70, 71}) or to 72, depending on the context. Note, the class {(3 , Pi} can only 
be constituted if the alphabet r has been previously reduced and {70, 7^ was replaced by 7 . 
Finally, we reduce the intermediate alphabet A, based on its equivalence classes {a }, {a 1: a 2 }, 
and {a 3 } (Fig. 6). 





input 


output 












Ti 











input 


output 




I 








T, 





B' 



input 



output 



r' 

To ; 



V 

'y 3 



output 



Solid lines mark modified relations. 
Figure 6: Transducer cascade with reduced intermediate alphabets (Example 3) 



The best overall reduction of the cascade is guarantied if the algorithm starts with the last 
intermediate alphabet, and processes all alphabets in reverse order. This is for the following 
reason: Suppose, an FST inside a cascade maps a to @ and ot\ to (3\ on two arcs with the 
same source and destination state (Fig. 7). If the alphabet B is processed first and (5q and Pi 
are reduced to one symbol (e.g. p ) then a and 07 will have equal output symbols and may 
therefore become reducible to one symbol (e.g. a ), depending on their other output symbols in 
this FST. If p and Pi cannot be reduced then subsequently a and 07 cannot be either because 
they have different output symbols, at least in this case. This means, reducing B first is either 
beneficial or neutral for the subsequent reduction of A. Not reducing B first is always neutral 
for A. Therefore, the intermediate alphabets of a cascade should be reduced all in reverse order. 
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Po,Pi€B 

Figure 7: Transducer inside a cascade (Example 4) 

4. Complexity 

The running time complexity of constructing equivalence classes is in the general case 
quadratic to the number of outgoing arcs on every state of the second FST, T 2 , of a pair: 

o(El^)l 2 ) (i) 

qeQ 2 

where is the number of outgoing arcs of state q. This is because in general every arc of 

q must be compared to every other arc of q. In the extremely unlikely worst case when all arcs 
have the same source state, this complexity is quadratic to the total number of arcs in T 2 : 

0{ \E 2 \ 2 ) (2) 

The running time complexity of replacing symbols by the representatives of their equivalence 
class is in any case linear to the total number of arcs in 7\ and T 2 : 

0(1^1 + 1^1) (3) 

The space complexity of constructing equivalence classes is linear to the size of the interme- 
diate alphabet that is to be reduced: 

0( |E mid | ) (4) 

The replacement of symbols by the representatives of their class requires no additional memory 
space. 




5. Evaluation 

Table 1 shows the effect of intermediate alphabet reduction on the sizes of FSTs and their 
alphabets in a cascade. The twelve FSTs are used for shallow parsing of French text, e.g., for 
the marking of noun phrases, clauses, and syntactic relations such as subject, inverted subject, 
or object (Ait-Mokhtar & Chanod, 1997). An input string to the cascade consists of surface 
word forms, lemmas, and part-of-speech tags of a whole sentence. The output is ambiguous 
and consists alternatively of a parse (only one per sentence) or of a syntactic relation such as 
"SUB J (Jean, mange) " (several per sentence). 

With some FSTs the reduction was considerable. For example, the number of arcs of the 
largest FST (#10) was reduced from 1 156 053 to 498 506. In some other cases no reduction 
was possible. For the whole cascade the number of arcs has shrunk from 2 704 047 to 1 73 1 83 1 
which represents a reduction of 36%. 
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originally 








after reduction 




FST 


#states 


#arcs 


#symbols 


#states 


#arcs 


#symbols 








input 


output 






input 


output 


1 


10 487 


404 903 


138 


137 


10 487 


404 903 


138 


137 


2 


604 


28 569 


136 


135 


604 


28 569 


136 


135 


3 


27 704 


225 215 


11 


10 


27 704 


225 215 


11 


10 


4 


3 613 


61 259 


23 


23 


3 613 


61 259 


23 


23 


5 


1 276 


128 222 


146 


142 


1 276 


124 754 


143 


142 


6 


3 293 


29 079 


12 


12 


3 293 


29 079 


12 


12 


7 


5 544 


166 704 


34 


33 


5 544 


90 024 


19 


18 


8 


396 


19 008 


48 


48 


396 


12 276 


31 


31 


9 


7 009 


370 419 


54 


54 


7 009 


204 411 


30 


27 


10 


6 033 


1 156 053 


158 


158 


6 033 


498 506 


66 


65 


11 


573 


114 328 


171 


171 


573 


52 801 


78 


45 


12 


2 


288 


144 


16 


2 


34 


17 


16 


total 


66 534 


2 704 047 






66 534 


1 731 831 







Bold numbers denote modifications. 



Table 1 : Sizes of transducers and of their alphabets in a cascade 



Each of these FSTs has its own alphabet which can contain "?" that matches any symbol that 
is not explicitly mentioned in the alphabet of the FST. Therefore, an FST can have a different 
number of output symbols than the following FST has input symbols because a given symbol 
may be part of the unknow alphabet in one FST but not in the other. 



Lexicon 




originally 






after reduction 






FST 


#states 


#arcs 


#symbols 


#states 


#arcs 


#symbols 










input 


output 






input 


output 


E 


T 


62 120 


156 757 


224 


298 










n 


T x 


75 900 


191 687 


224 


8 874 


73 780 


181 829 


224 


3 042 


g 


T 2 


16 748 


36 737 


8 729 


153 


16 748 


24 483 


2 897 


153 






92 648 


228 424 






90 528 


206 312 






F 


T 


55 725 


130 139 


224 


269 










r 


Ti 


61 467 


164 819 


224 


11 420 


59 697 


159 187 


224 


2 680 


e 


T 2 


6 814 


59 587 


11 241 


90 


6 814 


42 007 


2 501 


90 




T,+T 2 


68 281 


224 406 






66 511 


201 194 







T .... original FST, T\, T 2 , T\+T 2 ....first and second FST from factorization 
and sum of both. Bold numbers denote modifications. 

Table 2: Sizes of transducers and of their alphabets in factorization 



Intermediate alphabet reduction can also be useful with the factorization of a finitely am- 
biguous FST into two FSTs, Ti and T 2 , that operate in a cascade (Kempe, 2000). Here, Ti maps 
any input symbol that originally has ambiguous output, to a unique intermediate symbol which 
is then mapped by T 2 to a number of different output symbols. Ti is unambiguous. T 2 retains 
the ambiguity of the original FST, but it is fail-safe wrt. Ti. This means, the application of T 2 
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to the output of Ti never leads to a state that does not provide a transition for the next input 
symbol, and always terminates in a final state. This factorization can create many redundant 
intermediate symbols, but their number can be reduced by the above algorithm. Table 2 shows 
the effect of alphabet reduction on FSTs resulting from the factorization of an English and a 
French lexicon. 

Since T\ is unambiguous it can be further factorized into a left- and a right-sequential FST 
that jointly represent a bimachine (Schiitzenberger, 1961). The intermediate alphabet of a bi- 
machine, however, can be limited to the necessary minimum already during factorization (Elgot 
& Mezei, 1965; Berstel, 1979; Reutenauer & Schiitzenberger, 1991; Roche, 1997), so that the 
above algorithm is of no use in this case. 

6. Conclusion 

The article described an algorithm for reducing the intermediate alphabets in an FST cascade. 
The method modifies the component FSTs but not the overall relation described by the whole 
cascade. The actual benefit consists in a reduction of the sizes of the FSTs. 

Two examples from Natural Language Processing have been used to illustrate the effect of 
the alphabet reduction on the sizes of FSTs and their alphabets, namely a cascade for shallow 
parsing of French text and FST pairs resulting from the factorization of lexica. With some FSTs 
the number of arcs and symbols shrank considerably. 
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